Scalable Hypergraph Processing
نویسندگان
چکیده
A hypergraph allows a hyperedge to connect more than two vertices, using which to capture the high-order relationships, many hypergraph learning algorithms are shown highly effective in various applications. When learning large hypergraphs, converting them to graphs to employ the distributed graph frameworks is a common approach, yet it results in major efficiency drawbacks including an inflated problem size, the excessive replicas, and the unbalanced workloads. To avoid such drawbacks, we take a different approach and propose HyperX, which is a thin layer built upon Spark. To preserve the problem size, HyperX directly operates on a distributed hypergraph. To reduce the replicas, HyperX replicates the vertices but not the hyperedges. To balance the workloads, we investigate the hypergraph partitioning problem aiming at minimizing the space and the communication cost subject to two separate constraints on the hyperedge and the vertex workloads. With experiments on both real and synthetic datasets, we verify that HyperX significantly improves the efficiency of the learning algorithms when compared with the graph conversion approach.
منابع مشابه
Enabling Scalable Social Group Analytics via Hypergraph Analysis Systems
With the rapid growth of large online social networks, the ability to analyze large-scale social structure and behavior has become critically important, and this has led to the development of several scalable graph processing systems. In reality, social interaction takes place not just between pairs of individuals as in the common graph model, but rather in the context of multi-user groups. Res...
متن کاملSocial Hash Partitioner: A Scalable Distributed Hypergraph Partitioner
We design and implement a distributed algorithm for balanced k-way hypergraph partitioning that minimizes fanout, a fundamental hypergraph quantity also known as the communication volume and (k − 1)-cut metric, by optimizing a novel objective called probabilistic fanout. This choice allows a simple local search heuristic to achieve comparable solution quality to the best existing hypergraph par...
متن کاملHYDRA: HYpergraph-Based Distributed Response-Time Analyzer
It is important for almost all transaction processing and computer-communication systems to satisfy response time quantile targets. This paper describes HYDRA, a scalable parallel tool for the analytical determination of response time densities in large, structurally-unrestricted Markov models derived from high-level specifications. The tool exploits an efficient distributed uniformization-base...
متن کاملHypergraph Partitioning for Faster Parallel PageRank Computation
The PageRank algorithm is used by search engines such as Google to order web pages. It uses an iterative numerical method to compute the maximal eigenvector of a transition matrix derived from the web’s hyperlink structure and a user-centred model of web-surfing behaviour. As the web has expanded and as demand for user-tailored web page ordering metrics has grown, scalable parallel computation ...
متن کاملSimilarity analysis with advanced relationships on big data
Similarity analytic techniques such as distance based joins and regularized learningmodels are critical tools employed in numerous data mining and machine learning tasks. We focus on two typical techniques in the context of large scale data and distributed clusters. Advanced distance metrics such as the Earth Mover’s Distance (EMD) are usually employed to capture the similarity between data dim...
متن کامل